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ABSTRACT 


Direct  measurements  on  a  VAX/VMS  system  reveal  that  program  behavior  has  a 
significant  effect  on  the  performance  of  this  system.  For  a  monoprogrammed  batch 
workload  the  turnaround  time  of  a  job  can  be  reduced  by  up  to  50%  if  its  behavior  is 
improved.  This  is  for  jobs  with  virtual  space  that  can  fit  in  physical  memory.  For  larger 
jobs  the  improvement  can  reach  a  factor  of  100. 

In  a  multiprogramming  batch  environment  improving  the  behavior  of  programs 
increased  the  throughput  of  the  system  by  up  to  64%  for  balanced  workloads,  up  to 
400%  for  I/O  bound  workloads  and  up  to  419%  for  mixes  of  balanced  and  I/O  bound 
workloads.  Improving  the  program  behavior  also  reduces  the  overhead  time  of 
automatic  memory  management.  This  was  measured  to  reach  up  to  83%. 

This  case  study  points  towards  the  more  general  conclusion  that  program  behavior 
has  a  significant  influence  on  computer  system  performance  even  with  the  abundance  of 
hardware  resources  available  now  and  in  the  future. 
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1.  Introduction 

Recent  trends  in  implementing  virtual  memory  operating  systems  in  the  real  world 
assume  an  abundance  of  hardware  resources  --  mainly  physical  memory  and  secondary 
storage.  With  the  drastic  drop  of  memory  systems'  cost  through  the  last  decade,  this 
assumption  is  valid  for  mini  and  more  powerful  computers  [PoAg83].  This  is  relative  to 
the  resources  available  in  the  late  sixties  or  early  seventies.  System  designers  consider 
these  hardware  resources  to  be  the  primary  constraints  on  system  performance  (for  a 
given  CPU  speed)  [LeEc80]. 

Early  studies  pointed  out  that  another  factor  controlling  the  performance  of  virtual 
memory  computers  is  program  behavior  [BrGu68],  [Denn68a],  [KuLa70],  [Elsh74], 
[Ferr76].  However,  in  the  sixties  and  seventies,  the  concern  was  to  have  a  satisfactory 
performance  with  the  least  possible  physical  memory  on  the  machine.  With  the  drastic 
drop  in  memory  cost  one  wonders  if  program  behavior  is  of  any  significance  at  the 
present  or  in  the  future.   This  paper  is  an  attempt  to  answer  this  question. 

Traditionally  most  empirical  studies  of  program  behavior  would  use  trace-driven 
simulations  (for  example  see  [A1HK80],  [ALMY82],  [HaPo83],  [LaFe83]).  A  program  of 
interest  is  executed  interpretively,  a  record  is  made  of  each  of  its  memory  references, 
and  the  address  trace  that  results  is  used  to  drive  a  simulator  of  a  certain  environment. 
The  simulation  approach  has  many  advantages;  mainly  exact  reproducibility  and  ease  in 
changing  the  parameters  of  the  environment  being  simulated.  However,  simulations  have 
a  major  drawback.  It  would  be  difficult  to  use  trace-driven  simulation  to  accurately 
account  for  the  effects  of  various  aspects  of  modern  and  future  computer  systems  in  our 


study.  Examples  of  these  aspects  are:  multiprogramming,  the  execution  of  operating 
system  routines  for  a  nontrivial  percentage  of  CPU  time,  and  I/O  interference.  Our 
investigation  is  done  by  direct  measurements  on  a  real  machine.  A  similar  approach 
was  followed  recently  in  a  study  to  evaluate  the  performance  of  a  cache  system  [Clar83]. 
The  study  showed  a  noticable  difference  between  the  performance  of  the  cache  systems 
as  anticipated  by  simulation  studies  [Smit82]  and  those  obtained  by  direct  measurement 
on  a  real  machine. 

The  case  study  we  use  will  be  the  DEC  VAX/VMS  system  [Stre78],  [LeEc80|. 
Although  the  exact  reproducibility  of  results  using  simulation  will  be  lost  in  our 
approach,  the  margin  of  error  is  small  (we  discuss  this  issue  in  more  detail  in  section  2). 
The  advantage  of  our  approach  is  that  conclusions  will  be  based  on  measurements  of  a 
real  machine  with  all  the  complexities  of  its  architecture  and  operating  system.  Choos- 
ing a  specific  machine  and  its  operating  system  may  seem  to  limit  the  generality  of  our 
conclusions.  This  may  be  true  for  the  quantitative  parts  of  our  conclusions,  however,  a 
case  study  is  one  legitimate  way  of  exploring  the  effect  of  program  behavior  on  system 
performance.  This  is  specifically  because  the  designers  of  this  system  explicitly  declare 
that  they  assume  an  abundance  of  hardware  resources.  This  environment  is  suitable  for 
exploring  the  answer  to  our  question:  what  is  the  degree  of  influence  which  program 
behavior  still  has  on  the  performance  of  a  system  with  an  abundance  of  resources? 

We  performed  experiments  on  the  VAX/VMS  using  two  simple  programs.  The 
behavior  of  these  programs  can  be  easily  changed  through  one  simple  transformation, 
the  loop  interchange  transformation.  The  improvement  of  the  behavior  of  these  pro- 
grams varies  from  drastic  to  moderate.  Due  to  the  simplicity  of  the  programs,  we 


transformed  them  manually.  However,  the  automation  of  this  and  other  transforma- 
tions has  been  implemented  in  the  PARAFRASE  system  of  the  University  of  Illinois 
[Leas76],[Wolf78],[AbKL79],  [KKLW80],  [AbKL81],  [Wolf83]. 

In  Section  Two  we  discuss  our  experimental  process;  the  programs  used,  their 
transformed  versions,  the  performance  measures  used  and  the  experiments  conducted.  In 
Section  Three  we  present  and  discuss  the  results.  In  Section  Four  we  present  our  con- 
cluding remarks  concerning  the  questions  raised  in  this  paper. 

2.   The  Experimental  Environment 

The  following  is  a  brief  description  of  the  environment  in  which  these  experiments 
were  performed.  The  computer  used  is  a  VAX  11/750  running  version  3.3  of  DEC's 
operating  system,  VMS.  The  virtual  memory  page  size  of  this  system  is  512  bytes  and 
the  total  number  of  pages  of  real  memory  in  the  system  is  4096  pages.  The  operating 
system  itself  occupies  approximately  900  pages  of  main  memory.  VMS  uses  a  local 
FIFO  replacement  algorithm. 

Each  user  process  on  the  system  is  given  a  certain  set  of  main  memory  pages  called 
the  resident  set  on  which  to  execute  (in  DEC  literature  this  is  called  the  Working  Set  of 
the  process).  VMS  gives  an  initial  resident  set  of  size  WSDEFAULT  (this  parameter 
can  be  set  by  the  user)  to  each  process  from  which  it  dynamically  changes  the  amount 
of  memory  of  a  process  in  response  to  the  process'  paging  rate  and  the  amount  of  free 
memory  left  in  the  entire  system.  If  the  paging  rate  is  above  a  certain  level,  PFRATH, 
the  operating  system  increases  the  size  of  the  process'  resident  set  by  WSINC;  if  how- 


ever,  the  paging  rate  is  below  a  certain  level,  PFRATL,  the  operating  system  decreases 
the  amount  memory  for  the  process  by  WSDEC.  The  maximum  size  of  the  resident  set 
of  any  user  process  is  upper  bounded  by  the  system  parameter  WSMAX.  Additionally, 
each  user  process  has  its  own  upper  limits  to  the  size  of  its  resident  set.  WSEXTENT  is 
the  maximum  possible  resident  set  size  for  the  process  while  WSQUOTA  is  the  max- 
imum guaranteed  resident  set  size  for  the  process  (WSQUOTA  must  be  less  than  or 
equal  to  WSEXTENT).  The  resident  set  of  a  process  may  exceed  its  WSQUOTA  only 
when  there  are  more  than  BORROWLIM  number  of  free  pages  in  the  system.  The  size 
of  the  virtual  space  (in  pages)  associated  with  each  process  is  denoted  by  PVWS.  The 
maximum  number  of  physical  pages  occupied  by  a  process  during  its  lifetime  is  denoted 
by  PWSS. 

In  addition  to  the  pages  allocated  to  user  processes,  the  operating  system  keeps  a 
certain  amount  of  memory  free  in  the  free  page  list  and  the  modified  page  list  to  act  as 
a  page  cache.  When  a  page  is  released  from  the  resident  set  of  a  process  and  if  the  page 
was  modified  (and  thus  requiring  a  disk  write),  it  goes  into  the  modified  page  list;  how- 
ever, if  the  page  was  not  modified  it  goes  into  the  free  page  list.  The  operating  system 
keeps  the  size  of  the  free  page  list  above  FREELIM  pages  and  makes  sure  it  is  at  least 
FREEGOAL  pages  large  after  each  freeing  of  pages  from  user  processes  (freeing  in 
response  to  a  memory  shortage).  The  maximum  size  of  the  modified  page  list  is 
MPWJflLIMIT  and  the  minimum  size  is  MPWJ.OLIMIT.  If  a  process  faults  and  the 
page  is  in  either  list,  the  page  is  returned  to  the  process'  resident  set.  Such  a  page  fault 
is  relatively  uncostly  since  the  page  fault  can  be  satisfied  without  a  disk  I/O  request. 
Besides  being  part  of  a  paging  cache  the  modified  page  list  acts  as  a  staging  buffer  for 


the  clustering  of  disk  writes.  This  clustering  serves  to  reduce  the  amount  of  disk  I/O. 
Pages  from  the  modified  page  list  are  written  out  of  memory  in  clusters  of 
MPW_WRTCLUSTER  pages.  For  each  page  fault  requiring  a  disk  read,  a  cluster  of 
PFCDEFAULT  virtually  contiguous  pages  are  read  into  the  faulting  process'  resident 
set. 

VMS  also  swaps  entire  working  sets  between  memory  and  disk.  VMS  checks  the 
nonresident  executable  queues  to  find  the  highest  priority  process  to  be  swapped  in. 
Once  a  process  is  selected,  the  operating  system  must  find  enough  free  pages  to  hold  the 
process'  resident.  There  are  three  ways  to  obtain  these  free  pages,  the  first  is  to  take 
them  from  the  free  list,  the  second  is  to  do  a  disk  write  from  the  modified  page  list  and 
thus  free  those  pages,  and  the  third  is  to  swap  out  a  process  of  lower  or  equal  priority. 
Once  swapped  in  the  process  is  guaranteed  at  least  one  quantum  before  it  becomes  eligi- 
ble to  be  swapped  out. 

For  a  more  detailed  description  of  the  memory  management  in  VMS  see  [LeLi82]. 
Table  1  shows  the  values  of  the  system  parameters  used  in  this  installation.  The  values 
assigned  to  these  parameters  are  those  used  by  the  system  manager  of  the  site  to  suit 
the  workload  of  the  machine.  The  site  is  a  software  house  with  a  day  workload  consist- 
ing mainly  of  the  interactive  development  of  vectorizing  compilers  for  supercomputers 
while  at  night  the  production  runs  consists  of  compilation  jobs.  We  were  not  free  to 
vary  the  values  of  the  system  parameters  nor  did  we  intend  to  change  them.  We  felt 
that  for  the  investigation  we  are  doing  in  this  paper,  we  should  not  be  concerned  with 
tuning  issues.  The  designers  of  the  system  do  not  advocate  the  idea  of  putting  a  lot  of 
effort  in  tuning  the  system.  Instead,  they  advocate  the  use  of  the  default  values  for  the 


system  parameters  while  adding  more  hardware  resources  whenever  the  workload  out- 
grows the  system  [DEC82].  This  implies  that  with  sufficient  hardware  resources,  the  sys- 
tem performance  is  satisfactory  with  the  default  system  parameter  values.  This  paper 
examines  this  claim  from  the  point  of  view  that  considers  the  effect  of  program 
behavior.  Other  researchers  have  shown  that  tuning  an  early  version  of  this  system 
drastically  improved  its  performance  [Lazo79]. 

Table  1.    System  Parameters 


Parameter 

Value 

BORROWLIM 

300  pages , 

FREEGOAL 

500  pages 

FREELIM 

100  pages 

MPW_HILIMIT 

500  pages 

MPW  LOLIMIT 

100  pages 

MPW  WRTCLUSTER 

96  pages 

PFCDEFAULT 

32  pages 

PFRATH 

200  Faults/10  sees 

PFRATL 

100  Faults/10  sees 

QUANTUM 

200  ms 

WSINC 

150  pages 

WSDEC 

35  pages 

The  following  two  programs  (coded  in  FORTRAN)  were  used  for  the  experiments. 
The  first  program,  ADD,  sums  up  the  values  of  each  row  of  a  square  matrix.  The  second 
program,  MAD,  is  a  matrix  addition  of  two  square  matrices.  In  these  programs,  the 
matrices  are  referenced  by  rows.  Additionally,  we  have  transformed  versions  of  each  of 
the  programs  called  TADD  and  TMAD.  The  transformation  applied  to  these  programs  is 
loop  interchange.  Due  to  the  loop  interchange  transformation,  the  matrices  are  refer- 
enced   by    columns.     Each    of    these    programs    were    compiled    with    eight    versions, 


distinguished  by  the  problem  size.  <PROG>l  is  the  version  with  a  128  by  128  matrix, 
<PROG>2  for  256  by  256,  <PROG>3  for  384  by  384,  up  to  <PROG>8  for  1024  by 
1024  where  <PROG>  is  one  of  {ADD,  MAD,  TADD,  TMAD}. 

The  information  about  the  resource  usage  of  each  of  the  programs  is  reported 
through  the  accounting  log  files.  This  log  contains  the  time  at  which  the  process  ter- 
minated, the  number  of  I/O  requests  serviced,  the  number  of  page  faults,  the  peak  size 
of  the  resident  set  during  the  execution  of  the  process  (PWSS),  the  peak  virtual  memory 
space  in  pages  (PVWS)  allocated  to  the  process,  the  elapsed  CPU  time,  and  the  elapsed 
real  time. 

The  first  experiment  performed  was  to  run  each  program  in  a  monoprogramming 
batch  mode.  Each  program  was  run  once  for  each  size  of  the  data  array  both  for  the 
original  version  and  for  the  transformed  version.  The  elapsed  time  for  each  program 
was  noted  and  the  ratio  between  the  original  and  transformed  versions  of  the 
corresponding  sizes  were  compared.  The  purpose  of  this  experiment  is  to  illustrate  the 
effectiveness  of  the  transformations  on  programs  that  would  normally  be  monopro- 
grammed  on  a  VAX  machine.  The  processes  created  for  this  experiment  have  the  fol- 
lowing parameter  settings:  the  WSDEFAULT  is  set  to  250  pages,  the  WSQUOTA  is  set 
to  1500,  and  the  WSEXTENT  is  set  to  1500.  Programs  with  large  array  sizes  seem  to 
reflect  such  a  workload. 

The  second  experiment  examines  the  effectiveness  of  the  transformation  on  pro- 
grams that  would  normally  be  run  in  a  multiprogramming  environment.  The  original 
and  transformed  versions  of  the  programs  at  a  smaller  data  array  size  were  used  in  these 
experiments.      Each      version      of      the      program      was      multiprogrammed      with 


multiprogramming  level  (MPL)  varying  from  one  to  six.  MPL  copies  of  the  program 
were  started  at  the  same  time.  The  time  per  job  at  each  of  the  MPL's  was  compared 
between  original  and  transformed  versions  of  the  program.  We  also  multiprogrammed 
the  two  programs  ADD  and  MAD  by  submitting  three  copies  of  each  to  the  system.  A 
similar  run  was  done  for  the  transformed  programs.  The  processes  created  for  this 
experiment  have  the  same  parameter  settings  as  those  used  for  experiment  one. 

The  third  experiment  attempts  to  measure  the  reduction  of  automatic  memory 
management  cost  due  to  improved  program  behavior.  This  is  done  by  measuring  the 
time  of  running  an  original  program  versus  a  transformed  program  when  physical 
memory  allocated  to  the  process  of  the  original  program  is  greater  than  the  virtual  space 
needed  by  the  process.  The  original  version  of  the  program  was  run  with  the  WSDE- 
FAULT  =  WSQUOTA  =  WSEXTENT  set  to  a  value  greater  than  the  PVWS  of  the 
program.  The  transformed  program  was  run  three  times  once  for  each  of  the  following 
conditions:  (i)  WSDEFAULT  =  WSQUOTA  =  WSEXTENT  =  PVWS  (ii)  WSDE- 
FAULT  =  WSQUOTA  =  WSEXTENT  =  PWSS  and  (iii)  restricted  memory  (WSDE- 
FAULT =  WSQUOTA  =  WSEXTENT  =  213).  ADD3  and  its  transformed  counter- 
part were  picked  for  these  experiments.  This  is  the  largest  version  of  ADD  whose  PVWS 
can  fit  entirely  in  physical  memory. 

In  order  to  make  the  conditions  for  all  experiments  as  controllable  as  possible,  the 
experiments  were  run  only  when  no  other  user  processes  were  executing.  However,  some 
measure  of  uncertainty  is  unavoidable.  One  of  the  reasons  comes  from  the  interference 
caused  by  the  execution  of  the  operating  system  routines.  The  experiments  were  run 
from  a  batch  queue  that  operated  when  all  users  were  logged  off.  Synchronization  within 
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this  queue  was  done  in  order  to  insure  that  no  other  pending  job  would  start  until  the 
current  experiment  completed. 

Besides  the  operating  system  processes,  there  is  a  process  that  executes  at  the  time 
of  the  experiments  that  controls  the  submitting  of  jobs  to  be  run.  This  process  intro- 
duces a  slight  interference  with  the  execution  of  the  experiments.  Also,  the  job  control- 
ling the  simultaneous  submitting  of  multiprogrammed  jobs  actually  starts  the  jobs  in 
serial.  The  total  error  introduced  from  delays  in  submitting  jobs  and  the  interference  of 
the  job  submitting  program  account  for  some  of  the  slight  perturbations  in  the  results 
between  repeated  instances  of  the  experiments.  These  differences  have  been  computed  in 
several  reruns  for  several  of  the  experiments  and  have  been  found  to  be  of  no  significant 
consequence. 

3.    Experimental  Results 

3.1.   Experiment  One 

As  mentioned  in  Section  Two  the  purpose  of  this  experiment  is  to  investigate  the 
effect  of  program  behavior  on  the  performance  of  a  monoprogrammed  VAX/VMS  sys- 
tem. It  is  the  recommendation  of  the  designers  of  the  VAX/VMS  to  run  large  produc- 
tion jobs  in  a  batch  monoprogramming  mode  [DEC82].  In  fact,  these  types  of  jobs  are 
executed  in  this  manner.  The  effect  of  program  behavior  on  the  performance  of  the  sys- 
tern  in  such  an  environment  is  probably  best  measured  by  its  effect  on  the  turnaround 
time  of  jobs.  Other  measures  of  interest  are  the  PWSS  and  the  number  of  page  faults 
generated.  Now  we  discuss  the  results  we  obtained  for  programs  ADD  and  MAD. 
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3.1.1.   Program  ADD 

Table  2  shows  the  space  statistics  for  this  program  for  different  array  sizes.  For  the 
version  where  the  two  dimensional  array  is  128  by  128  (version  1)  the  space  occupied  by 
array  elements  is  129  pages  (the  page  size  in  real  words  is  128,  hence  128  pages  are  occu- 
pied by  the  two  dimensional  array  and  one  page  by  the  result  vector).  For  a  matrix  size 
of  1024  by  1024  (version  8)  we  need  8*1024  +  8  =  8200  pages.  Note  that  the  code  and 
the  scalar  storage  requirements  are  negligible.  Hence,  the  149  page  difference  between 
the  occupied  virtual  process  space,  PVWS,  and  its  array  data  pages  is  the  part  of  virtual 
space  allocated  to  the  process  for  its  control  region  (for  user  mode  stack  and  other  pro- 
tected process  specific  data  and  code  as  well  as  for  the  stacks  of  higher  access  modes 
such  as  the  supervisor,  executive,  and  kernel)  [LeEc80].  This  control  space  was  measured 
to  be  constant  for  all  the  programs  in  all  the  experiments  at  149  pages.  This  is  more 
than  50  percent  of  the  total  process  space  for  program  ADD  at  array  size  of  128  by  128 
and  is  independent  of  the  problem  size. 

From  Table  2  we  notice  that  for  a  given  problem  size  the  difference  between  the 
PWSS  for  ADD  and  TADD  is  less  than  1%  for  versions  1  through  4  and  between  6% 
and  21%  for  versions  5  through  8.  Moreover,  for  some  versions  TADD.PWSS  > 
ADD.PWSS  and  for  others  ADD.PWSS  >  TADD.PWSS.  This  indicates  a  failure  of  the 
automatic  techniques  implemented  in  VMS  to  take  advantage  of  the  extreme  difference 
between  the  behavior  of  ADD  and  TADD  and  the  difference  in  their  space  requirements. 
Optimally,  TAAD  would  only  need  a  maximum  of  16  (taken  from  analysis  of  version  8) 
array  data  pages  at  any  given  time  during  its  execution  for  a  total  of  165  pages  of  pro- 
cess space.  It  is  not  within  the  scope  of  this  paper  to  discuss  why  the  mechanisms  of 
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VMS  failed  to  distinguish  between  the  behavior  of  the  two  programs.  The  important 
observation  is  that  it  occurred  irrespective  of  whether  the  process  space  can  fit  in  physi- 
cal memory  or  not  and  the  fact  that  TADD's  fault  rate  is  much  lower  than  ADD's  fault 
rate.  This  is  shown  in  Table  3. 

Table  3  also  shows  the  turnaround  time  of  each  program  and  the  ratio  of  tur- 
naround times  for  the  original  and  transformed  programs.  Note  that  the  improvement  in 
turnaround  time  is  up  to  50  percent  for  version  1  to  4.  For  these  programs  the  system 
can  keep  the  pages  of  the  user  process  in  physical  memory  (in  the  process  resident  set, 
free  page  list,  and  modified  page  list).  For  versions  5  to  8,  the  main  memory  of  the 
machine  is  not  large  enough  to  hold  most  of  the  virtual  memory  space  of  program  ADD. 
Disk  page  faults  are  now  more  frequent  than  for  the  smaller  versions  of  ADD  and  the 
improvement  ratio  jumps  to  around  400  percent.  The  table  also  shows  the  improvement 
in  the  number  of  page  faults.  Since  the  number  of  page  faults  include  both  cheap  and 
expensive  ones,  we  feel  that  the  page  fault  ratio  is  less  indicative  of  the  improvement 
than  the  ratio  of  turnaround  times.  Figure  1  shows  a  plot  of  this  ratio  for  program 
ADD  versus  the  result  vector  length  (problem  size).  Notice  the  step  in  the  curve  when 
most  of  the  process'  virtual  space  does  not  fit  in  core. 

Our  conclusion  from  Table  3  is  that  the  behavior  of  programs  like  ADD  is  still  a 
major  factor  in  VAX/VMS  performance  in  a  monoprogrammed  batch  mode.  Improving 
the  behavior  of  programs  through  compile  time  transformations  can  reduce  the  tur- 
naround time  by  a  factor  of  4  without  the  need  for  expanding  the  main  memory  size. 
Even  if  the  main  memory  size  was  increased  to  adequately  hold  almost  all  of  the  virtual 
process  space,  program  transformation  can  reduce  turnaround  time  by  an  appreciable 
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percentage  (50%  was  measured). 

3.1.2.    Program  MAD 

This  program  adds  two  2-dimensional  matrices  and  stores  the  result  in  a  third 
matrix.  The  space  needed  for  array  elements  is  three  times  the  space  needed  for  ADD, 
however,  the  amount  of  arithmetic  is  the  same.  Thus,  one  can  consider  this  program  to 
be  I/O  (or  paging)  bound  relative  to  ADD.  Tables  4  and  5  show  measurements  for  MAD 
similar  to  Tables  2  and  3  for  ADD. 

Table  4  supports  the  conclusion  reached  in  the  previous  section  about  the  inability 
of  VMS  to  differentiate  between  MAD  and  a  drastically  better  behaved  TMAD.  The 
differences  between  the  PWSS  of  the  original  and  transformed  programs  are 
insignificant. 

As  for  the  improvement  in  the  turnaround  time,  we  notice  in  Table  5  that  it  grows 
rapidly  from  a  factor  of  3.3  to  64.88  as  soon  as  the  process  virtual  space  is  too  large  to 
fit  in  core  (versions  3  to  6).  For  programs  that  can  fit  in  core  the  improvement  is  in  the 
range  of  20%.  Figure  2  illustrates  the  pronounced  improvement  that  the  transformed 
version  gives  when  the  program  does  not  fit  in  core.  Note  that  the  amount  of  additional 
memory  needed  to  make  this  program  fit  in  core  is  three  times  what  is  needed  to  fit  pro- 
gram ADD  in  core.  In  contrast  with  the  drastic  memory  expansion  approach  to  making 
running  such  a  program  practical  on  such  a  machine,  the  transformation  approach 
allows  a  drastic  improvement  in  system  performance  without  the  need  for  expanding  the 
memory.  This  program  should  run  with  167  pages  with  almost  the  same  performance 
(for  version  6). 
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3.2.   Experiment  Two 

Most  VAX/VMS  systems  are  actually  used  in  a  multiprogramming  interactive 
environment  for  an  appreciable  percentage  of  the  time.  This  kind  of  a  workload  would 
consists  of  program  development,  debugging,  running  system  facilities  such  as  the  mail- 
ing system,  and  running  interactive  systems  such  as  data  base  systems.  In  this  section 
we  try  to  study  the  effect  of  program  behavior  on  system  performance  when  the 
machine  is  multiprogrammed.  For  a  workload  with  the  two  programs  discussed  in  this 
paper,  one  would  expect  that  jobs  used  in  a  multiprogrammed  environment  to  have 
smaller  array  sizes  than  those  used  in  a  monoprogrammed  environment.  One  can  think 
of  the  runs  in  the  multiprogramming  mode  as  test  runs.  When  numerical  programs  pass 
the  test  runs,  they  are  scheduled  for  execution  with  realistically  large  problem  sizes  in  a 
batch  monoprogrammed  mode. 

3.2.1.   Program  ADD 

We  choose  ADD2  (256  by  256  matrix)  for  our  multiprogramming  experiment.  The 
total  virtual  process  space  for  this  program  is  663  pages  which  is  reasonable  for  such  an 
environment;  it  is  not  trivial  and  at  the  same  time  not  too  large  to  require  monopro- 
gramming. 

Table  6  shows  the  time  per  job  for  MPL  varying  from  one  to  six  for  ADD  and 
TADD.  The  table  also  shows  the  total  virtual  space  of  user  processes  for  each  MPL. 

Examining  the  time  per  job  for  ADD,  we  notice  that  the  system  is  being  multipro- 
grammed in  a  rather  optimal  way.  The  system  is  operating  at  the  maximal  flat  portion 
of  the  throughput  (and  CPU  utilization)  versus  degree  of  multiprogramming  model 
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curve  [DKLP76]  (see  Figure  3  and  Figure  4).  We  see  that  the  system  is  multiprogram- 
ming six  processes  with  a  total  of  4000  pages  of  virtual  space  while  the  time  per  job  is 
approximately  equal  to  the  time  for  monoprogramming  the  job.  Table  3  shows  that  ver- 
sion 5  of  ADD  (with  PVWS  =  3354  pages)  is  thrashing.  From  this  we  conclude  that 
multiprogramming  is  very  effective  under  VAX/VMS.  The  same  conclusion  is  reached  by 
examining  the  time  per  job  for  multiprogramming  MPL  copies  of  the  transformed  pro- 
gram TADD. 

Notice  that  the  throughput  of  the  system  improves  when  the  transformed  workload 
is  run.  The  ratio  of  time/job  of  ADD  to  TADD  ranges  from  1.24  to  1.35  with  an  average 
and  median  of  1.29.  Table  7  and  Figure  5  presents  the  results  for  this  experiment  with 
version  3  of  ADD.  The  improvements  are  more  pronounced  in  this  case.  This  workload 
is  an  I/O-CPU  balanced  workload.  Next  we  examine  multiprogramming  an  I/O  bound 
workload. 

3.2.2.   Program  MAD 

As  discussed  earlier  this  program  generates  more  page  faults  and  does  the  same 
amount  of  computation  as  ADD.  Table  8  shows  our  findings  for  multiprogramming  up 
to  six  copies  of  MAD2  and  TMAD2. 

We  notice  that  for  the  untransformed  workload  the  system  is  thrashing  [Denn68b]. 
We  are  seeing  the  negative  slope  part  of  the  throughput  (CPU  utilization)  versus  mul- 
tiprogramming level  function.  The  time  per  job  increases  from  .201  minutes  in  the 
monoprogramming  case  to  .819  minutes  at  MPL=6.  This  is  a  factor  of  four.  Improving 
the  behavior  of  the  workload  reduces  the  thrashing  effect  significantly.  Figure  6  shows 
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that  MAD2  is  in  the  negative  slope  region  of  the  model  multiprogramming  curve  (Figure 
3)  while  TMAD2  is  at  the  flat  region.  The  ratio  of  time  per  job  at  MPL  =  6  to  the 
uniprogrammed  case  is  1.6  for  TMAD2.  Moreover,  we  see  that  the  ratio  of  the  time  per 
job  for  the  original  workload  to  that  of  the  transformed  workload  ranges  between  1.20 
to  3.82  with  an  average  of  2.63  and  median  of  2.82.  This  is  a  very  significant  improve- 
ment in  the  throughput  and  utilization  of  the  machine. 

3.2.3.   Mix  of  ADD  and  MAD 

In  this  experiment  three  copies  of  ADD5  and  three  copies  of  MAD3  were  multipro- 
grammed  together.  We  repeated  this  experiment  using  transformed  programs  with  three 
copies  of  TADD5  and  three  copies  of  TMAD3.  The  total  time  to  finish  the  original  ver- 
sions of  the  programs  was  8.25  minutes  while  the  total  time  to  finish  the  transformed 
versions  was  1.97  minutes;  giving  an  improvement  by  a  factor  of  4.19.  Notice  that  this 
improvement  is  greater  than  either  of  the  improvements  in  independently  monopro- 
gramming TADD5  over  ADD5  (3.63)  or  TMAD3  over  MAD3  (3.30). 

3.3.   Experiment  Three 

In  this  experiment  we  consider  the  question  of  whether  program  behavior  affects 
the  overhead  associated  with  automatic  memory  management.  To  be  conservative  we 
picked  the  more  balanced  program,  ADD,  for  this  experiment  rather  than  the  paging 
bound  program  MAD.  To  obtain  results  whose  experimental  error  percentage  is  negligi- 
ble, we  chose  the  largest  version  of  ADD  with  memory  requirements  that  can  fit  in  core. 
This  version  of  ADD  is  ADD3  with  PVWS=1304  pages. 
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We  ran  ADD3  in  a  monoprogrammed  batch  mode  with  WSDEFAULT  = 
WSQUOTA  =  WSEXTENT  =  1500  pages.  Thus,  it  was  allocated  real  space  greater 
than  its  logical  space.  Then  we  ran  TADD3  with  three  different  settings  of  these  param- 
eters. 

(A)  WSDEFAULT  =  WSQUOTA  =  WSEXTENT  =  1500  PAGES 

In  this  case,  both  the  original  and  transformed  programs  were  allocated  the  same 
amount  of  real  memory,  enough  to  exceed  their  logical  space  size.  The  turnaround  time 
of  TADD3  was  .153  minutes  while  for  ADD3  it  was  .280  minutes.  The  ratio  of  the  two 
times  is  1.83.  This  is  a  very  significant  reduction  in  automatic  memory  management 
overhead. 

(B)  WSDEFAULT  =  WSQUOTA  =  WSEXTENT  =  PVWS  (1304  PAGES) 

The  real  space  allocated  to  TADD3  is  exactly  equal  to  its  virtual  space.  The  tur- 
naround time  was  practically  identical  to  that  in  the  previous  case  and  hence  the 
improvement  was  also  identical. 

(C)  WSDEFAULT  =  WSQUOTA  =  WSEXTENT  =  213  PAGES 

In  this  case  we  reduce  the  memory  allocation  of  TADD3  to  213  pages.  This  is 
enough  space  to  hold  the  149  pages  used  by  the  system  and  a  cluster  of  32  pages  when  a 
fault  to  the  two  dimensional  array  occurs  plus  another  32  pages  when  a  fault  to  the 
result  vector  occurs.  The  turnaround  time  was  .155  minutes.  The  ratio  of  turnaround 
time  of  the  orginal  program  to  this  time  is  1.81.  Note  that  the  space  has  been  reduced 
by  a  factor  of  1500/213  =  7.0. 
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From  these  results  we  note  that  the  reduction  in  the  automatic  memory  manage- 
ment overhead  due  to  improved  program  behavior  is  very  significant. 

4.   Concluding  Remarks 

The  experimental  data  presented  in  this  paper  shows  that  program  behavior  can 
still  have  a  major  influence  on  computer  system  performance.  The  point  that  this  paper 
makes  is  that  program  behavior  can  not  be  simply  ignored  by  system  designers.  It  is  cer- 
tainly true  that  having  an  abundance  of  hardware  resources  -  large  fast  main  memories 
and  high  bandwidth  I/O  -  helps  to  improve  the  performance  of  computer  systems.  How- 
ever, these  hardware  resources  do  not  eliminate  the  influence  of  program  behavior. 
Moreover,  user  demands  on  memory  space  and  processor  speed  of  computer  systems  are 
always  greater  than  what  manufacturers  supply.  With  time  this  demand  is  growing  in  a 
more  rapid  pace  than  the  growth  of  hardware  resources. 

The  conclusions  presented  in  this  paper  are  based  on  a  case  study  of  a  VAX 
750/VMS  system.  Two  simple  example  programs  were  used.  Both  do  the  same  amount 
of  arithmetic,  however,  one  requires  more  space  and  has  higher  paging  activity.  The 
behavior  of  these  programs  can  be  drastically  improved  through  simple  compile  time 
transformations.  In  a  production,  monoprogrammed,  batch  environment,  the  turnaround 
time  of  untransformed  programs  can  be  1.5  times  the  turnaround  time  of  the 
transformed  programs.  This  is  assuming  that  there  is  enough  main  memory  on  the 
machine  to  hold  all  of  the  program's  virtual  space.  When  the  logical  space  of  programs 
outgrow  the  main  memory,  this  ratio  can  jump  to  4.0  for  balanced  jobs  and  approach 
100  for  paging  bound  jobs. 
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In  a  multiprogrammed  batch  environment  improving  the  behavior  of  programs 
achieves  two  goals.  The  system  was  moved  from  a  thrashing  state  to  a  maximum 
resource  utilization  state.  Second,  a  pronounced  improvement  in  throughtput  was 
achieved.  A  factor  of  1.64  was  measured  for  a  balanced  load,  3.82  for  an  I/O  bound  load 
and  4.19  for  a  mixed  load. 

Improving  program  behavior  also  leads  to  reducing  the  system  overhead  associated 
with  automatic  memory  management.  A  factor  of  1.83  was  measured  when  both  the  ori- 
ginal and  transformed  versions  of  the  program  where  given  the  same  amount  of 
memory.  Additionally,  a  factor  of  1.81  was  measured  when  the  transformed  version  used 
only  one  seventh  of  the  amount  of  memory. 

Much  more  extensive  measurements  and  experimentation  needs  to  be  done  to 
present  results  which  can  apply  to  a  wide  class  of  application  programs.  However,  the 
results  presented  in  this  paper  and  the  more  comprehensive  work  discussed  in  [AbKL79] 
and  [AbKL81]  show  that  compile  time  transformations  have  a  real  substantial  potential 
in  improving  the  performance  of  paged  computer  systems. 
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Table  2   Memory  Requirements  for  Programs  ADD  and  TADD 


Version 

Size 

data  pages  (DP) 

PVWS 

PVWS  -  DP 

ADD  PWSS 

TADD  PWSS 

1 

128 

129 

278 

149 

230 

232 

2 

256 

514 

663 

149 

583 

550 

3 

384 

1155 

1304 

149 

1221 

1226 

4 

512 

2052 

2201 

149 

1080 

1095 

5 

640 

3205 

3354 

149 

1300 

1093 

6 

768 

4614 

4763 

149 

1241 

1388 

7 

896 

6279 

6428 

149 

1300 

1387 

8 

1024 

8200 

8349 

149 

1404 

1103 

Table  3   Execution  Statistics  for  Programs  ADD  and  TADD 


ADD    |  TADD 

ADD    |  TADD 

Version 

Size 

Time  (min.) 

Ratio 

Pagefaults 

Ratio 

1 

128 

.058 

.059 

.98 

343 

343 

1.00 

2 

256 

.115 

.087 

1.32 

918 

849 

1.08 

3 

384 

.213 

.141 

1.51 

2016 

1811 

1.11 

4 

512 

.363 

.254 

1.43 

4742 

4343 

1.12 

5 

640 

1.55 

.427 

3.63 

7342 

4013 

1.83 

6 

768 

2.29 

.601 

3.80 

10374 

5297 

1.96 

7 

896 

3.21 

.806 

3.98 

13682 

6961 

1.97 

8 

1024 

3.98 

1.03 

3.88 

17859 

8949 

2.00 
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Table  4  Memory  Requirements  for  Programs  MAD  and  TMAD 


Version 

Size 

data  pages  (DP) 

PVWS 

PVWS  -  DP 

MAD  PWSS 

TMAD  PWSS 

1 

128 

384 

533 

149 

451 

451 

2 

256 

1536 

1685 

149 

1500 

1293 

3 

384 

3456 

3605 

149 

1500 

1387 

4 

512 

6144 

6293 

149 

1500 

1500 

5 

640 

9600 

9749 

149 

1477 

1396 

6 

768 

13824 

13973 

149 

1500 

1500 

Table  5  Execution  Statistics  for  Programs  MAD  and  TMAD 


MAD     |  TMAD 

MAD      |  TMAD 

Version 

Size 

Time  (min.) 

Ratio 

Pagefaults 

Ratio 

1 

128 

.086 

.078 

1.09 

871 

692 

1.26 

2 

256 

.199 

.167 

1.19 

3248 

1945 

1.67 

3 

394 

1.69 

.512 

3.30 

8467 

3804 

2.23 

4 

512 

17.14 

.903 

18.70 

1427244 

6777 

210.6 

5 

640 

52.55 

1.28 

41.20 

2456614 

10472 

234.6 

6 

768 

141.43 

2.18 

64.88 

3539429 

14930 

237.1 
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Table  6  Time  Per  Job  for  Multiprogramming  ADD2  and  TADD2 


MPL 

ADD2  t  (min.) 

TADD2  t  (min.) 

Ratio 

Total  PVWS 

1 

.116 

.086 

1.35 

663 

2 

.109 

.083 

1.31 

1326 

3 

.114 

.089 

1.28 

1989 

4 

.113 

.091 

1.24 

2652 

5 

.115 

.088 

1.31 

3315 

6 

.111 

.083 

1.25 

3978 

Table  7   Time  Per  Job  for  Multiprogramming  ADD3  and  TADD3 


MPL 

ADD3  t  (min.) 

TADD3  t  (min.) 

Ratio 

Total  PVWS 

1 

.236 

.144 

1.64 

1304 

2 

.268 

.178 

1.51 

2608 

3 

.252 

.160 

1.58 

3912 

4 

.263 

.162 

1.62 

5216 

5 

.273 

.170 

1.61 

6520 

6 

.276 

.169 

1.63 

7824 
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Table  8  Time  Per  Job  for  Multiprogramming  MAD2  and  TMAD2 


MPL 

MAD2  t  (min.) 

TMAD2  t  (min.) 

Ratio 

Total  PVWS 

1 

.201 

.168 

1.20 

1685 

2 

.591 

.216 

2.74 

3370 

3 

.534 

.246 

2.17 

5055 

4 

.932 

.244 

3.82 

6740 

5 

.822 

.281 

2.93 

8425 

6 

.819 

.283 

2.89 

10110 
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Figure  1   Turnaround  Time  Ratio  for  Original  and  Transformed  ADD 
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Figure  2   Turnaround  Time  Ratio  for  Original  and  Transformed  MAD 
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Figure  3  Throughput  vs.  MPL  Model  Curve 
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Figure  4  Time/ Job  vs.  MPL  for  Programs  ADD2  and  TADD2 
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Figure  5   Time/ Job  vs.  MPL  for  Programs  ADD3  and  TADD3 
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